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Abstract We present techniques to parallelize membership tests for Deterministic Finite Automata 
(DFAs). Our method searches arbitrary regular expressions by matching multiple bytes in parallel 
using speculation. We partition the input string into chunks, match chunks in parallel, and combine 
the matching results. Our parallel matching algorithm exploits structural DFA properties to minimize 
the speculative overhead. Unlike previous approaches, our speculation is failure- free, i.e., (1) sequential 
semantics are maintained, and (2) speed-downs are avoided altogether. On architectures with a SIMD 
gather-operation for indexed memory loads, our matching operation is fully vectorized. The proposed 
load-balancing scheme uses an off-line profiling step to determine the matching capacity of each par- 
ticipating processor. Based on matching capacities, DFA matches are load-balanced on inhomogeneous 
parallel architectures such as cloud computing environments. 

We evaluated our speculative DFA membership test for a representative set of benchmarks from 
the Perl-compatible Regular Expression (PCRE) library [35] and the PROSITE [36] protein database. 
Evaluation was conducted on a 4 CPU (40 cores) shared-memory node of the Intel Manycore Testing 
Lab (Intel MTL), on the Intel AVX2 SDE simulator for 8-way fully vectorized SIMD execution, and on 
a 20-node (288 cores) cluster on the Amazon EC2 computing cloud. Obtained speedups are on the order 
of 0(1 + tqt~ ); where \P\ denotes the number of processors or SIMD units, \Q\ denotes the number of 
DFA states, and < 7 < 1 represents a statically computed DFA property. For all observed cases, we 
found that 0.16 < 7 < 0.47. Actual speedups range from 1.6x to 38. 2x for up to 512 states for PCRE, 
and between 1.2x and 13. 9x for up to 766 states for PROSITE on a 40-core MTL node. Not taking 
communication costs into account, speedups on the EC2 computing cloud range from 5.2x to 173. 9x 
for PCRE, and from 2.2x to 98x for PROSITE. Including communication costs, EC2 speedups range 
from 5.1x to 71x for PCRE, and between 2.1x and 51. 3x for PROSITE protein patterns. Speedups of 
our C-based DFA matcher over the Perl-based ScanProsite scan tool [3D] range from 410. 8x to 7781. 3x 
on a 40-core MTL node. 
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Algorithm 1: Sequential DFA matching 

Input : transition function <5, input string Str, start state qo, set of final states F 
Output: true if input is matched, false otherwise 

1 state «— qo 

2 for i «— to \Str\ — 1 do 

3 |_ state <- (5(state, Str[i\) 

4 if state £ F then 

5 |_ return true; II input matched 

6 return false 



1 Introduction 

Locating a string within a larger text has applications with text editing, compiler front-ends and 
web browsers, scripting languages, file-search (grep), command-processors, databases, internet search 
engines, computer security, and DNA sequence analysis. Regular expressions allow the specification 
of a potentially infinite set of strings (or patterns) to search for. A standard technique to perform 
regular expression matching is to convert a regular expression to a DFA and run the DFA on the input 
text. DFA-based regular expression matching has robust, linear performance in the size of the input. 
However, practical DFA implementations are inherently sequential as the matching result of an input 
character is dependent on the matching result of the previous characters. To speed up DFA matching 
on parallel architectures, considerable research effort has been spent already [30 ( 47, 18 , 23 , 29 ( 41 , 32 , 17, 

To speed up DFA matching on parallel architectures, we propose to use speculation. With our 
method, the input string is divided into chunks. Chunks are processed in parallel using sequential DFA 
matching. For all but the first chunk, the starting state is unknown. The core insight of our method is 
to exploit structural properties of DFAs to bound the set of initial states the DFA may assume at the 
beginning of each chunk. Each chunk will be matched for its reduced set of possible initial states. By 
introducing such a limited amount of redundant matching computation for all but the first chunk, our 
DFA matching algorithm avoids speed-downs altogether (i.e., the speculation is failure- free [31]). To 
achieve load-balancing, the input string is partitioned non-uniformly according to processor capacity 
and work to be performed for each chunk. These properties opens up the opportunity for an entire new 
class of parallel DFA matching algorithms. We present the time complexity of our matching algorithms, 
and we conduct an extensive experimental evaluation on SIMD, shared- memory multicore and cloud 
computing environments. For experiments, we employ regular expressions from the PCRE Library [3 5) 
and from the PROSITE protein pattern database [55] . 

The paper is organized as follows. In Section [21 we introduce background material. In Section [3[ 
we discuss a motivating example for our speculative DFA matching algorithms. In Section 21 we in- 
troduce our algorithms and their complexity with respect to speedup and costs. Section [5] shows three 
implementations for SIMD, shared-memory multicore and cloud-computing environments. Section [5] 
contains experimental results. We discuss the related work in Section [7] and draw our conclusions in 
Section M 



2 Background 

2.1 Finite Automata 

Let E denote a finite alphabet of characters and E* denote the set of all strings over E. Cardinality \E\ 
denotes the number of characters in E. A language over E is any subset of E* . The symbol denotes 
the empty language and the symbol A denotes the null string. A finite automaton A is specified by a 
tuple (Q, E,5,qo, F), where Q is a finite set of states, E is an input alphabet, <5 : Q x E — > 2® is a 
transition function, qo £ Q is the start state and F C Q is a set of final states. We define A to be a DFA 
if S is a transition function of Q x E — » Q and 5(q, a) is a singleton set for any q £ Q and a £ E. Let 
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Fig. 1 Example DFA including the error state q e (a) and 12-symbol input string (b) 

\Q\ be the number of states in Q. We extend transition function 8 to S*: 5*(q, ua) — p <=?■ 5*(q, u) = q' , 
5(q', a) — p, a 6 S, u e S* . We assume that a DFA has a unique error (or sink) state q e . 

An input string Str over £ is accepted by DFA A if the DFA contains a labeled path from go to a 
final state such that this path reads Str. We call this path an accepting path. Then, the language L(A) 
of A is the set of all strings spelled out by accepting paths in A. 

The DFA membership test determines whether a string is contained in the language of a DFA. The 
DFA membership test is conducted by computing 8*(qo 7 Str) and checking whether the result is a final 
state. Algorithm [1] denotes the sequential DFA matching algorithm. As a notational convention, we 
denote the symbol in the i th position of the input string by Str[i] . For a comprehensive background 
on automata theory we refer to [T§1H5] . 



2.2 Amazon EC2 Infrastructure 

The Amazon Elastic Computing Cloud (EC2) allows users to rent virtual computing nodes on which 
to run applications. EC2 is very popular among researchers and companies in need of instant and 
scalable computing power. Amazon EC2 provides resizable compute capacity where users only pay 
for the capacity that their applications actually require. Amazon EC2 virtual computing nodes are 
Linux-based virtual machines running on top of the Xen hypervisor. By using virtualized resources, 
a computing cloud can serve a much broader user base with the same set of physical resources. EC2 
virtual machines are called instances. To provide a unit of measure for the compute capacities of 
instances, Amazon introduced so-called EC2 Compute Units (CUs), which are claimed to provide the 
equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor [3j. Because there 
exist many such CPU models in the market, the exact processor capacity equivalent to one CU is 
not entirely clear. Instance types are grouped into seven families, which differ in their processor, I/O, 
memory and network capacities. Instances are described in J3J; the instances employed in this paper 
are outlined in Section [S] To create a cluster of EC2 instances, the user requires the launch of one or 
more instances, for which the instance type and the VM image must be specified. The user can specify 
any VM image that has been registered with Amazon, including Amazon's or the user's own images. 
Once instances are booted, they are accessible as computing nodes via ssh. A maximum of 20 instances 
can be used concurrently. 



3 Overview 



The core idea behind our speculative DFA matching method is to divide the input into several chunks 
and process chunks in parallel. As a motivating example we consider the DFA depicted in Figure [TJ 
This DFA accepts strings which contain zero or more occurrences of the symbol a, followed by exactly 
one occurrence of symbol 6, followed by zero or more occurrences of symbol c. For the exposition of 
this motivating example we have included the DFA's error state q e and its adjacent transitions, which 
are depicted in gray. The DFA's alphabet is S = {a, b, c}, and we consider the 12-symbol input string 
from Figure [Tf b) . 

Assuming that it takes on the order of one time-unit to process one character from the input 
string, Algorithm [T] will spend 12 time units for the sequential membership test. This is denoted by 
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the following notation, were a processor po matches the input string from Figure db). The DFA is in 
state qo initially. 

aaaaaaabcccc 
Po- qo 

To parallelize the membership test for three processors, the input string can be partitioned into 
three chunks of four symbols each, and assigned to processors po, p\ and P2 as follows. 



CO 


Cl 


C2 


aaaa 




aaab 




cccc 



(1) 



Po- qo pi- ?o,<7i P2- qo,qi 



Because the DFA will initially be in start state qo, the first chunk (co) needs to be matched for go only. 
For all subsequent chunks, the DFA state at the beginning of the chunk is initially unknown. Hence, 
we use speculative computations to match subsequent chunks for all states the DFA may assume. We 
will discuss in Section 3] how the amount of speculative computations can be kept to a minimum. 
For our motivating example, we assume the DFA to be in either state go or gi at the beginning of 
chunks ci and C2- As depicted by the partition from Eq. (JlJ, processor po will match chunk Co for 
state go* whereas processors p\ and p2 will match their assigned chunks for both go & n d gi- To match 
a chunk for a given state, a variation of the matching loop (lines 1-3) of Algorithm [1] is employed. 

After processors po, Pi and pi have processed their assigned chunks in parallel, the results from 
the individual chunks need to be combined to derive the overall result of the matching computation. 
Combining proceeds from the first to the last chunk by propagating the resulting DFA state from the 
previous chunk as the initial state for the following chunk. According to Figure [TJ the DFA from our 
motivating example will be in state go after matching chunk cq. State go is propagated as the initial 
state for chunk c\. Processor p\ has matched chunk c\ for both possible initial states, i.e., go and gi, 
from which we obtain that state go at the beginning of chunk c\ takes the DFA to state gi at the end 
of chunk ci. Likewise, the matching result for chunk C2 is now applied to derive state gi as the final 
DFA state. 

To compute the speedup over sequential DFA matching, we note that processor po processes 4 input 
characters, whereas processors p\ and P2 match the assigned chunks twice, for a total of 8 characters 
per processor. The resulting speedup is thus -g- or 1.5 (Combining the matching results will induce 
slight additional costs on the order of the number of chunks, as we will consider in Section [4]). 



CO 


Cl 


C2 


a a aaaa 




abc 




ccc 



(2) 



Po- qo Pi- qo,qi P2- qo,qi 

An input partition that accounts for the work imbalance between the initial and all subsequent 
chunks is depicted in Eq. @. Because processors pi and P2 match chunks for two states each, their 
chunks are only half the size of the chunk assigned to processor pq. All processors now process 6 char- 
acters each, resulting in a balanced load and a 2x speedup over sequential matching. 



Co ^ — ^ 


Cl y 


C2 


a a a a 




a a a b 




cccc 


Po- qo 


Pi- qo 


P2- qi 



By considering the structure of DFAs, the amount of redundant, speculative computation can be 
reduced. For the DFA in Figure [1] we observe that for each alphabet character x G £ = {a, b, c}, there 
is only one DFA state (except the error state g e ) with an incoming transition labeled x. Thus, this 
particular DFA has the structural property that for any given character x £ E, the DFA state after 
matching character x is known a-priory to be either the error state or the state with the incoming 
transition labeled x. 
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A processor can exploit this structural DFA property by performing a reverse lookahead to deter- 
mine the last character from the previous chunk. From this character the DFA state at the beginning 
of the current chunk can be derived. In Eq. (|3]), the reverse lookahead for our motivating example is 
shown. Reverse lookahead characters are shaded in gray. Character a is the lookahead character in 
chunk Co; only DFA state qa from Figure Q] has an incoming transition labeled a, thus the DFA must 
be in state qo at the beginning of chunk c\. Likewise, the DFA must be in state q\ at the beginning of 
chunk C2, because state q\ is the only DFA state with an incoming transition labeled b (the lookahead 
character of chunk c\). Note that for these considerations the error state q e can be ignored, because 
once a DFA has reached the error state, it will stay there (see, e.g., Figure [T]). Thus, to compute the 
DFA matching result it is immaterial to process the remaining input characters once the error state 
has been reached. 

Because now all processors have to match only a single state per chunk, the chunks are of equal 
size. For three processors, we achieve a speedup of 3x over sequential matching for the motivating 
example. 

It should be noted that in the general case the structure of DFAs will be less ideal, i.e., there will be 
more than one state with incoming transitions labeled by a particular input character. Consequently, 
each chunk will have to be matched for more than one DFA state. We will develop a measure for the 
suitability of a DFA for this type of speculative parallclization in Section 2) Our analysis of the time- 
complexity of this method shows that for \P\ > 1, a speedup is achievable in general. This has been 
confirmed by our experimental evaluations on on SIMD, shared-memory multicore, and the Amazon 
EC2 cloud-computing environments. We will discuss the trade-offs that come with multi-character 
reverse lookahead, and we will incorporate inhomogeneous compute capacities of processors to resolve 
load imbalances. This is essential to effectively utilize heterogeneous multicore architectures, and to 
overcome the performance variability of nodes reported with cloud computing environments |42l H] . 

4 Speculative DFA Matching 

Our speculative DFA matching approach is a general method, which allows a variety of algorithms 
that differ with respect to the underlying hardware platform and the incorporation of structural DFA 
properties. We start this section with the formalization of our basic speculative DFA matching example 
from Section[3] We then present our approach to exploit structural DFA properties to speed up parallel, 
speculative DFA matching. Section [S] contains variants tailored for SIMD, shared memory multicores 
and cloud computing environments. 

4.1 Basic Speculative DFA Matching Algorithm 

Our parallel DFA membership test consists of the following four steps; the first step is only required 
on platforms with processors of inhomogeneous performance. 

1. Offline profiling to determine the DFA matching capacity of each participating processor, 

2. partitioning the input string into chunks such that the utilization of the parallel architecture is 
maximized, 

3. performing the matching process on chunks in parallel such that redundant computations are 
minimized, and 

4. merging partial results across chunks to derive the overall result of the matching computation. 

Offline Profiling: For environments with inhomogeneous compute capacities, our offline profil- 
ing step determines the DFA matching capacities of all participating processors. This information is 
required to partition work equally among processors and thus balance the load. With heterogeneous 
multicore hardware architectures such as the Cell BE [Tl], offline profiling must be conducted only 
once to determine the performance of all types of processor cores provided by the architecture. With 
cloud computing environments such as the Amazon EC2 cloud [3], users only have limited control 
on the allocation of cloud computing nodes. However, the performance of cloud computing nodes has 
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(a) 



Str = bababbababbababbaaabbababbbaabbaaaba 
00 

Fig. 2 Example DFA (a) and input string with 36 symbols (b). 



Processor 


m k 


w k 


L -w k 


Input character range 


Po 


50 


1.5 


28.8 


0-27 


Pi 


25 


0.75 


3.6 


28-31 


P2 


25 


0.75 


3.6 


32-35 



Table 1 Computation of chunk sizes for Figure[2]and three processors of non-uniform processing capacities. 



been found to differ significantly, which is by a large extent attributed to variations in the employed 
hardware platforms 42,4 . To compensate for the performance variations between cloud computing 
nodes, offline profiling will be conducted at cluster startup time. Profiling cluster nodes in parallel takes 
only on the order of milliseconds, which makes the overhead from profiling negligible compared to the 
substantial cluster startup times (on the order of minutes by our own experience and also reported 
in [31]) on EC2. 

To account for performance variations, we introduce a weight factor w k , which denotes the processor 
capacity of a processor p k , normalized by the average processor capacity of the system. On each 
processor p k , our profiler performs several partial sequential DFA matching runs for a predetermined 
number of input symbols on a given benchmark DFA. From the median of the obtained execution times, 
we compute the number of symbols m k matched by processor p k per microsecond. The processor's 
weight factor w k is then computed as 

Wk = mk ' \ Tp\ ' ^ mt ) ' ^ 

V 0<i<|P| / 

Columns u m k v and "tOfe" of Table [T] contain example matching capacities and corresponding weights 
for a system of three processors. We will apply processor weights to partition the input string into 
chunks as follows. 

Input Partitioning: We observed already with our motivating example from Eq. ([T]) that parti- 
tioning the input into equal-sized chunks will result in load-imbalance: because for the first chunk the 
initial DFA state is known to be qo, the first chunk needs to be matched only once. All other chunks 
must be matched for all possible initial states of the chunk , i.e., \Q\ times, in the worst case. In what 
follows, we will derive a partition of the input Str into \P\ chunks, assuming that all except the first 
chunk need to be matched for \Q\ states. In Section W?R we will exploit structural DFA properties to 
reduce the number of states to be matched per chunk. 

Intuitively, because processor po has to match chunk cq only once, it can process a larger portion 
of the input Str than the processors assigned to subsequent chunks. (This was observed already with 
Eq. ([5]), where chunk sizes were adjusted such that all processors processed the same number of 
characters from the input.) The objective of our optimization is to determine chunk sizes in such a 
way that the processing times for all chunks are equal. The purpose of the following equations is to 
compute a partition of the input into chunks Cj, < i < \P\, where chunk Cj is a sequence of symbols 
from the input allocated to processor pi. 

Let Li denote the length of chunk Cj when < i < \P\, and n be the length of the input Str. Let us 
further assume that matching of a character from the input takes constant time. Processor po matches 
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chunk Co from starting state go- All other chunks need to be matched for all possible initial states. To 
keep work among processors balanced, chunk Co must be \Q\ times longer than the other chunks, i.e., 
it must hold that 

L i = ^ forl<i<|P|. (5) 



The lengths of all chunks must add up to n, namely 

0<i<|P| 



(6) 



If processors have non- uniform processing capacity, we incorporate weight factors from Eq. (U]), such 
that weighted chunk sizes must add up to n. 



0<i<\P\ 



(7) 



Finally we solve the unknown Lq by substituting corresponding parts of Eq. ([5]) and (HJ in Eq.©, 
i.e., 

n-\Q\ 



Ln = 



wq ■ \Q\ + J2i 



(8) 



<i<\p\ 



The start and end positions for each chunk Cfc, < k < \P\, are computed by the following equations. 
(Note that for k = 1, the range for the sum over L Wi is 0.) 



StartPos(cfc) 



0. 



L w + fa J2i<i <k L o l 



for k — 0, 
otherwise 



(9) 



EndPos(c fe ) 



-1, 

[LqWq 



1 

101 



for k=\P\- 1, 



Ei<i<fc L oWi\ - 1, otherwise 



(10) 



An example DFA and an input string of length n = 36 are presented in Figure[5] The corresponding 
chunk sizes for three processors with different processing capacities are depicted in Table [T] We observe 
by Eq. © that the length Lq of chunk cq is 19.2 characters, and the weighted length according to 
processor weight wq is 28.8 characters. From Eq. ([5]) we observe that the remaining chunks are four 
times shorter than chunk Cq, because they have to be matched for \Q\ =4 states. The weighted lengths 
of chunks c\ and ci are thus 3.6 characters each. The rightmost column of Table [T] depicts the character 
ranges of the input as they have been assigned to each chunk. 

Matching of Chunks: Algorithm [5] depicts our basic speculative DFA matching procedure. We 
employ the notation introduced in |18j to denote a mapping of possible initial states to possible last 
active states of a chunk. This mapping is required to store a chunk's matching results for all possible 
initial states. After matching chunks in parallel, the computed mappings will be used to derive the 
overall DFA matching result. Formally, this mapping is defined as a vector 



IQI— iJ 



where < i < \P\ and lj £ Q for all < j < \Q\. Let element lj of Li denote the last active state, 
assuming that processor pi starts in state qj and processes the DFA membership test on chunk c^, i.e., 

S*(Qj,Ci) = lj. 

As an example, we consider chunk ci from Eq. (|TJ) and the DFA from Figure [T] Chunk C2 will be 
matched for the possible initial states go and q%, with the resulting last active states q e and q\ and 
the result vector £2 — [q e ,qi]. The meaning of vector £2 is that if the DFA assumes state go at the 
beginning of chunk C2, then it will be in state q e after matching chunk C2. If the DFA assumes state gi 
at the beginning of chunk C2, then it will be in state q\ after matching chunk C2. 

Our basic speculative DFA matching procedure employs Eqs. © and (|10p to derive the start and 
end position of each chunk (lines 4-5 of Algorithm [2]) . The algorithm distinguishes between the first 
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input chunk (lines 6-8) and all subsequent chunks (lines 9-12). According to our partitioning scheme, 
chunk Co is only matched for the start state go (lines 7-8). For all subsequent chunks Ci, all possible DFA 
states are matched and stored in vector Ci. Chunk sizes are chosen according to processor weights and 
the number of states to be matched with each chunk. The goal of this partitioning is to load-balance 
the DFA matching to effectively utilize the underlying parallel hardware platform. We will discuss in 
Section FOI that our partitioning scheme makes this speculation failure-free. The output of Algorithm^ 
is the set of vectors Ci, where each vector describes the possible last states according to the possible 
initial states of a given chunk. 



Algorithm 2: Basic speculative DFA matching 



Input : 8, Q, S, P, Str = cqc\ . . . C|p|_ 
Output: vector d for each chunk Ci 
for i i— to \P\ — 1 do in parallel 
for j <- to IQI - 1 do 

L £M*-3 ; 

Start <— StartPos(ci) 
End EndPos(ci) 
if i = then 

for k <— Start to End do 

|_ £o[0] ^S(C o [0],Str[k]) 

else 

foreach j e Q do 

for k <— Start to End do 

|_ Ci\j]*-6(£i\j],Str[k]) 



II initialize vector Ci 



II chunk en 



// chunks ci . . . cipi_i 



Merging of Partial Results: After matching chunks in parallel, each processor pi has constructed 
a mapping Ci of possible initial states to last active states. To finish the DFA run, the partial results 
computed for chunks Ci need to be combined to determine the last active state for the DFA-run over 
the whole input string Str = cqC\ . . . c\p\_\. Chunk Co is the only chunk for which we know the initial 
state of the automaton, i.e., qo- We use this information to apply the mappings Ci sequentially to 
derive the last active state as follows (it should be noted that index of the £[■•■] mapping is the 
index of the start state qo): 



last active state = £|p|_i[£|p|_ 2 [- • -£o[0] 



(11) 



It has been shown in |18j how a binary reduction (see |28p can be used to parallelize this computation. 
A binary reduction uses a combining operation on two maps Ci and Cj to derive the combined map Ci^ 
as depicted in Eq. (fT2"j). 

c^cm] 



c — 



Cj[Ci[l]] 



C 3 [C 



[101 -i]] 



(12) 



The reduction step above can be performed repeatedly in parallel to combine maps until we finally ar- 
rive at the map £ ,|P|-i which represents the overall effect of a DFA. In particular, the value £ o ,|P|-i[0] 
will be the last active state of a DFA's run on the input Str. 

The work in 1.8, does not provide an evaluation of the relative merits of sequential vs. parallel 
merging of £-vectors. In particular, the details of the employed parallel reduction algorithm are not 
specified. We conducted experiments on a 40-core shared memory node of the Intel MTL using a 
binary tree for the parallel reduction to find that the computation associated with the merging of 
£-vectors is too little to justify the overhead of a parallel reduction. Especially the overhead from the 
synchronization required between each of the C(log 2 (|P|)) reduction steps is costly. 
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Moreover, the overhead becomes significant if communication cost between nodes are introduced 
such as with cloud computers. We describe our findings on the overheads of intra-node and inter- 
node communication with the EC2 computing cloud in detail in Section [5l Section [5] introduces a new 
£-vector merging technique to cope with the overhead on cloud computers. 

In short, we applied the sequential merging from Eq. (fTTJ) with shared- memory multicore archi- 
tectures and a new hierarchical merging technique for cloud computing architectures, which will be 
explained on Section [5] 



4.2 Optimizations Based on Structural DFA Properties 

The amount of work associated with a given chunk is determined by (1) the length of the chunk, and 
(2) the number of DFA states for which the chunk needs to be matched. In the following, we will 
distinguish between the initial chunk Co, and subsequent chunks Cj, i > 0. Before matching the initial 
chunk Co, the DFA will be in the starting state qo, thus chunk Co only needs to be matched for q . 
Prior to the matching of subsequent chunks, the DFA may assume any state in the general case, thus 
subsequent chunks need to be matched \Q\ times (see, e.g., the motivating example in Eq. (fTJ). In this 
section we will exploit structural properties of DFAs to deduce a potentially smaller number X max < \Q\ 
of states which is the upper bound of initial states for all subsequent chunks. 

The best case, i.e., I max = 1, has already been observed with our motivating example DFA from 
Figure [Tj For each character a £ E of this DFA, it holds that there is only one state targeted by a 
transition labeled a. Irrespective of the particular input character a, the DFA can only assume a single 
state after matching character a. (As mentioned previously, for these considerations we may safely 
disregard the error state q e , because from the error state no other state is reachable; thus, a DFA that 
reached the error state will stay there.) If there is only one possible DFA state after matching an input 
character, it follows that the DFA can only be in one state after matching the last character prior 
to each subsequent chunk. Thus the DFA can only be in one possible state at the beginning of each 
subsequent chunk, and we have I max = 1- 

In the general case, values for 2n iax can range between 1 and \Q\. In the remainder of this section, 
we will investigate how to deduce this I m ax value for a particular DFA, and how this information 
can be incorporated with our speculative DFA matching algorithm. We will consider real-world DFAs 
from PCRE and PROSITE to find that for all considered DFAs it holds that I max < \Q\, and that 
this property can be used to improve DFA matching performance. We have already observed with the 
input partition of Eq. ([3]) that reducing the number of initial states of subsequent chunks enables us 
to increase the sizes of subsequent chunks. Larger subsequent chunks will reduce the size of the initial 
chunk Co in turn. Because we adjust chunk sizes such that all chunks will be processed in the same 
amount of time, reducing the size of the initial chunk Co will reduce the overall execution time of the 
matching process. The overarching reason for this performance improvement is that the reduction of 
potential initial states reduces the total number of symbols that have to be matched per chunk. 

This can be formalized as follows. Let X max denote the maximum number of possible initial states 
that the DFA may assume at the start over all subsequent chunks. This maximum can be different for 
each chunk, depending on the last character of the preceding chunk. We assume that for X max we pick 
the maximum value out of all possible sets of initial states over all chunks. If Imax < \Q\, then the 
length Lq of chunk cq reduces by Eq. © : 



Wq ' -^-max ~l~ 

l<i<|P| Wi 



< 



(13) 

n-\Q\ 



w ■ \Q\ + T,1<K\P\ Wi 



To deduce the maximum value for X ma x, we eliminate states that can never be the initial state 
for a given chunk. For each character a G S, a DFA will contain a number of states that have an 
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incoming transition labeled a. Thus, if the last character of a chunk's preceding chunk is cr, then only 
the states with an incoming transition labeled a need to be matched. We call the last input character 
of a chunk's preceding chunk the reverse lookahead symbol. The number of states to be matched for a 
reverse lookahead symbol a € £ is a static property of a DFA. It will range between 1 and \Q\. The 
maximum number of states to be matched over any reverse lookahead symbol constitutes an upper 
bound on I max , i-e., an upper bound on the number of states to be matched for any subsequent chunk. 
Because I ma x is a static DFA property, we can use it to partition the input into chunks according to 
Ea. (fT3|) . At run-time, a processor will use the reverse lookahead symbol to determine the initial states 
to be matched for its assigned chunk. 

Given lookahead symbol cr, we define the set of initial states X a as the set of all states that have 
an incoming transition labeled cr. 

l a = {s: 8{x, cr) = s}, Vs, xeQ. (14) 

If symbol a is the reverse lookahead symbol of chunk c* , then the set of possible initial states for 
chunk Cj is I a . We compute the set of possible initial states for all symbols from the DFA's alphabet £ 
and set I ma x to the maximum cardinality among those sets, i.e., 

Imax = maxflZo-l) . (15) 

As an example, consider the DFA from Figure [2ja). Figure [3] shows the input string partitioned for 
three processors of equal capacity, i.e., u>q = Wi = W2 = 1. The reverse lookahead symbols are depicted 
in gray. No reverse lookahead is required for chunk Co, which will be matched from the DFA's start 
state go- Because the reverse lookahead symbol a of chunk c\ is an 'a', upon matching of chunk c\ the 
DFA can only be in a state that has an incoming transition labeled 'a'. Likewise, because the reverse 
lookahead symbol of chunk c 2 is '&', the DFA can only be in a state that has an incoming transition 
labeled '6' upon matching of chunk c 2 . We get X a = {91,(73}, X& = {(72,93}, and I m ax = 2. Inserting 
n = 36, I max = 2, |Q| = 4 and wo = w\ = w 2 = 1 in Eq. (fT3|) yields L = 18 < 24 and a speedup of 
II = 1.3 over the non-optimized matching procedure. 



Co > 

Str : bababbababbababbaa 



ci . 

abbababbb 



c 2 

aabbaaaba 



Po- 9o Pi- 9i,93 P2- 92,93 

Fig. 3 Partitioned input string with reverse lookahead symbols and set of initial states to be matched for each chunk. 



Algorithm [3] applies initial state sets with the DFA matching procedure. Lines 1-7 compute initial 
state sets I CT from Eq. (fT4)l and I ma x from Eq. (fTS"]) . Unlike Algorithm[2j the partitioning is now based on 
the maximum number of possible initial states, I max , instead of \Q\ . The StartPos and EndPos functions 
that compute the start and end position of each chunk now receive I max as the second argument 
(lines 11-12 in Algorithm [3]) . We updated Eqs. ^ and (|10p to include an additional parameter to 
pass I ma x- In Eqs. (|5|)- (|10p . instead of \Q\ we then use the provided argument value to partition the 
input string and to compute the start and end position of each chunk. 

Because the maximum number of initial states I ma x is a static property of a DFA, it can be com- 
puted off-line. The overhead to compute I ma x can thus be avoided with DFAs that are matched multiple 
times. E.g., with protein patterns maintained in databases, corresponding DFAs can be expected to be 
matched on several DNA sequences. However, with all our experiments, we computed I ma x online for 
every matching run (as stated in Algorithm [3]), to account for the general case were a DFA is matched 
only once. 

Another possible optimization of Algorithm [3] concerns the distribution of cardinalities of initial 
state sets I ff . If the maximum value I ma x is significantly larger than the average, then it is desirable to 
divide the input at boundaries with reverse lookahead symbols that have a small initial state set. This 
would further decrease the number of possible initial states of subsequent chunks. However, searching 
the input for the occurrence of particular characters constitutes an effort similar to the matching 
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process itself. Moreover, relying on statistical properties of the input string (i.e., the occurrence of 
particular characters in the input) may violate the failure-freedom of our speculation: if a reverse 
lookahead symbol with a low set of initial states cannot be found, then additional states need to be 
matched, resulting in a possible speed-down. In contrast, by considering l max states, our optimization 
always shows equal or better performance than the non-optimized matching procedure that has to 
match all states in Q. 



4.3 Multiple Reverse Lookahead Symbols 

As discussed in the previous section, a smaller I max constant will decrease the number of symbols to be 
matched per chunk, thereby increasing DFA matching performance. We can potentially decrease the 
number of possible initial states, if we employ additional reverse lookahead symbols with each chunk. 
Given a string of reverse lookahead symbols o\ . . . o~k, k > 1. We number the reverse lookahead symbols 
in the order they are matched by the DFA, which is the reverse order of the lookahead itself. The set 
of initial states I a constitutes the set of all states that are the target of a path through the DFA 
labeled by a string with postfix <j\ . . .a^, i.e., 

%Ti...<r k = { s '■ S*(x,ai . . .o-fc) = s}, Vs,x6Q. (16) 

Let I m ax r be the maximum number of possible initial states when using r reverse lookahead symbols (in 
particular, I ma x,i = Imax)- Algorithm U shows for a reverse lookahead of 2 characters how to compute 
initial state sets X ai CT2 and constant I max ,2- As evident from this example, the time complexity for 
computing X maxr is O (\S\ r ■ \Q\ + \Q\), i.e., the algorithm is exponential in the number r of reverse 
lookahead symbols. 

The following lemma establishes that when increasing the amount of reverse lookahead symbols, 
the maximum number of possible initial states T max r of a DFA is bounded above by X max . 

Lemma 1. Given a DFA, it holds that T max = X max 1 > X max 2 > • • • > X max u) where uj denotes the 
length of the longest accepting path through the DFA. 



Algorithm 3: DFA matching applying initial state sets 



Input : 8, Q, S, P, Str = cqc\ . . . C|p|_ 
Output: vector C p . for each chunk Ci 

1 foreach Ui 6 £ do 

2 X a% +- 

3 foreach s € Q do 

4 ^target <- 6(s,tTi) 

5 if qtarget Qe then 
|_ <~ ^CTi U qtarget 



1, 90 



7 Imax 

8 for i ■ 

9 
10 

11 
12 
13 
14 
15 

16 
17 
18 
19 



max(X CT0 , . . . , X CT|S|1 ) 

to \P\ — 1 do in parallel 
1 do 



for j «- to \Q\ 

Start -s— StartPos(c;,X m ax) 
End <— EndPos(ci, I ma x) 
if i = then 

for k <— Start to End do 
|_ Co[0] <r- 6{Ca[0], Str{k]) 

else 

foreach j G I Ci do 

for fc 4— Start to End do 

|_ Ci\j]<r-6(Ci\j},Str[k]) 



// initialize vector d 



// chunk co 



// chunks c\ . . . cipi_i 
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Algorithm 4: Initial state set I aia2 an d Zmax 2 computation for 2-character reverse lookahead 

Input : 5, Q, £ 

Output: I ai ff 2 ,Imax,2 

1 foreach u\ £ £ do 

2 foreach 02 6 S do 

3 2:<ti<t 2 <- 

4 foreach q d Q do 

5 |_ Z CT1CT2 <- l£r 1CT2 U (<5(<5(<?, cti), 0-2) \ {<j e }) 

6 2max,2 — max (Ti ,ct 2 G-S (|Zo-i<t 2 I) 



Proof. Indirect. WLOG we assume a DFA with exactly one of its transitions labeled by a symbol a € E, 
and state q being the target state of this transition. For this DFA, \X a \ = 1. Given another symbol a' 6 
X 1 , we assume that \I a 'a\ = 2. Then by the definition of I a i a in Eq. (IT51) , this DFA must have two distinct 
states that are the target of a path labeled by a string with postfix a' a. However, this implies that 
these two target states have an incoming transition labeled <r, which contradicts our initial assumption 
that \I a \ — 1. Thus for any two symbols a and a', it holds that > \T a 'a\- The extension to the 
general case > \1a-'a 1 ...a k \ is straightforward and the lemma follows. □ 



Original U i 
Symbols=1 
Symbols=2 i 
Symbols=3 i 
Symbols=4 



1 



700 




500 - 



Original U j 
Symbols=1 
Symbols=2 1 
Symbols=3 i 
Symbols=4 



11 23 

Number of Original States 

(a) PCRE 




26 52 78 126 

Number of Original States 

(b) PROSITE 



Fig. 4 Sizes of |Q| and I max ,r f° r various numbers of reverse lookahead symbols. The height of a bar represents the 
absolute number of states in the corresponding set, i.e., \Q\, X ma x,i, 2 m ax,2i 2 m ax,3 and X max 4- 



We investigated the sizes of possible initial state sets for the PCRE and PROSITE benchmark 
suites for 1, 2, 3 and 4 reverse lookahead symbols. Figure |4] depicts the number of states \Q\ and the 
number of possible initial states for 299 PCRE benchmark DFAs and 110 PROSITE protein patterns. 
(For DFAs with the same number of states, the possible initial state set sizes were averaged.) The 
height of a bar in Figure U denotes the number of states in the corresponding set. For example, 
the rightmost, largest DFA in Figure IDJb) consists of |Q|=766 states. One-symbol reverse lookahead 
reduces to 2" max l = 234 states. For two-symbol, three-symbol and four-symbol lookahead, the possible 
initial state sets reduce to 107, 57 and 56 states. The average size of possible initial state sets for 
1, 2, 3 and 4 reverse lookahead symbols compared to the overall number of states \Q\ is depicted in 
Table [5J Applying a reverse lookahead of one symbol to the PCRE benchmarks reduces the number of 
possible initial states on average to 33.7% of the original states. Applying 2, 3 and 4 reverse lookahead 
symbols yielded further reductions of 7%, 10% and 12% over \Q\. With the PROSITE benchmarks, 
one symbol reverse lookahead reduced on average to 47.2% of the original states. Applying 2, 3 and 4 
reverse lookahead symbols yielded further reductions of 18%, 26% and 31% over \Q\. The profitability 
of reverse lookahead is a static property of DFAs, which is reflected in this data: while for PCRE one 
symbol lookahead already yields a large reduction on the number of states, lookahead >2 symbols 
does not provide substantial improvement. However, with PROSITE, one symbol reverse lookahead 
provided a smaller improvement, while reverse lookahead up to 4 symbols yielded steady gains. 
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r 





1 


2 


3 


4 


PCRE 


100% 


33.7% 


26.4% 


23.7% 


21.7% 


PROSITE 


100% 


47.2% 


29.2% 


20.5% 


16.0% 



Table 2 Average size of Xmax r compared to \Q\, for r reverse lookahead symbols 



Because of the exponential time complexity to compute X max r , there is a trade-off between the 
overhead of the reverse lookahead computation and the obtainable performance gains. To quantify this 
overhead, we investigated the cost of reverse lookahead computations on an Intel Xeon 5120 CPU. 
Figure [5ja) shows the overhead in microseconds to compute I maxj for an example DFA of \Q\ = 5 
up to three reverse lookahead characters. As expected, the overhead is exponential in the size of E. 
Figure [SJb) depicts the overhead for increasing numbers of states. Because I max ,r is a static property 
of a DFA, it can be computed off-line, and then loaded when the matching operation is performed. 
This way the overhead can be avoided with DFAs that are matched many times (e.g., protein patterns 
from databases) . It should be noted that all our experiments in Section [6] include the overhead for 
Imax r computations, which shows that esp. for smaller numbers of reverse lookahead the overhead is 
tolerable. 



10° 



(a) X n 



symbols=3 
symbols=2 
symbols=1 



10 J 
10 4 



10' 
10 1 
10° 



10 3 / 



symbols=3 
symbols=2 
symbols=1 



100 



700 



200 300 400 500 600 

|Sigma| Number of Original States 

r calculation overhead for various numbers of re- (b) X max T calculation overhead for various numbers of re- 
verse lookahead symbols per \Q\ 



verse lookahead symbols per |J7| 
Fig. 5 Required overhead due to X max r calculation 



4.4 Time Complexity 

The time complexity of sequential DFA matching is 0(n), where n is the length of the input string. 
Our basic speculative DFA matching approach from Section I4.ll distinguishes the first chunk from 
subsequent chunks to partition the input string such that the matching load is balanced. The time 
required for parallel matching is on the order of 

( "' IQI ) . (17) 
U {\Q\ + \P\-l) [t) 

The speedup of Algorithm[2]over sequential matching is thus on the order of 0(1+ ^p ). It follows that 
in terms of algorithm complexity, this approach will not produce a speed-down, i.e., it is failure- free. 

Eq. (fl~8| shows the time complexity of parallel DFA matching with reduced sets of potential initial 
states from Section 14.21 Because computing X max r constitutes overhead, we have an additional term 
0(|Q| • ^D, where r is the number of reverse lookahead symbols. 

°( l0| - wr+ ftl^'-i ) (18) 
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Name 


CPU Model 


CPUs 




Clock Freq. 


Note 


Intel MTL 


Intel Xcon E7-4860 


4 


10 


2.27 GHz 


n/a 


SDE emula- 
tor on local 
server 


AVX2/Haswcll on 
Intel Xeon E5405 
host 


n/a 


n/a 


n/a 


n/a 


Amazon EC2 
(m2.4xlarge) 


Intel Xeon X5550 


2 


4 


2.67 GHz 


26 EC2 CUs 


Amazon EC2 
(cc2.8xlarge) 


Intel Xcon E5-2670 
Sandy Bridge 


2 


8 


2.60 GHz 


88 EC2 CUs 



Table 3 Hardware Specifications 



If n \Q\, or if I max i , is computed off-line, the additional term can be neglected. Even when computing 
2max,r on-line, for all considered cases the approach with reduced sets of potential initial states showed 
better performance. 

Our method is capable of utilizing processors of different processing capacities, which is relevant for 
heterogeneous multiprocessors and for cloud computing environments. Different processor weights w 
encode processors' computational power. Because we employ weights to calculate chunk sizes for pro- 
cessors, we encode different processing capacities in the size of each processor's chunk. If we do not 
apply weights for processors of different processing capacities, the following equation describes the 
overall time complexity, 

,„ / nm \ 
\m + p - 1 J 

where p = \P\ X w worst and w worst = min(ui ,u>i, W|p|-i) and m is either \Q\ or |I max ,.|. 



5 Implementation 

We implemented our speculative DFA matching algorithms for the three architectures summarized 
in Table [3] For our shared-memory multicore architecture implementation we were granted access 
to the Intel Manycore Testing Lab (Intel MTL, |21j). which is an experimental environment of non- 
commercial, 40-core nodes provided by Intel mainly for educational purposes. POSIX threads [IQj were 
used to parallelize DFA matching across multiple cores. To vectorize our speculative DFA matching 
algorithm, we employed version 2 of the Advanced Vector Extensions (AVX2) of the forthcoming Intel 
Haswell CPU architecture 20.. The AVX2 instruction set provides 256 bit registers enabling 8- fold 
vectorization on 32-bit integer and single precision floating point data types. AVX2 is the first x86 
instruction set extension to provide a gather-operation for vectorized indexed read operations from 
memory (vectorized register- indirect addressing). To the best of our knowledge, we are the first to 
utilize gather operations to vectorize DFA matching. Because the Haswell architecture is scheduled to 
be released in 2013, there is no processor available yet which supports AVX2 instructions. Hence, we 
used Intel's Software Development Emulator (SDE, [22]) to emulate AVX2 instructions. To evaluate 
our approach in a cloud computing environment, we employed m2.4xlarge and cc2.8xlarge instances of 
the Amazon EC2 elastic computing cloud 3 . Each EC2 instance provides a nominal dedicated compute 
capacity stated in Amazon's proprietary Compute Unit (CU) measure. Hardware specifications of the 
used Amazon EC2 instance types (nodes) are given in Table [3] For our experiments, we employed 
20 instances with a total of 320 physical cores. For communication across threads, the MPI message 
passing interface was used. 

We tailored our DFA data-structures to maximize performance and to utilize the AVX2 instruction 
set, in particular the novel AVX2 32-bit gather operations. To generate minimal DFAs from regular 
expressions, we use Grail+ [381113) . which is a formal language toolset for the manipulation and ap- 
plication of regular expressions and automata. Our DFA matching framework reads DFAs and input 
strings in Grail+ format and converts them to our framework's internal representation. 

DFA transition tables are usually represented as 2-dimensional arrays, with rows for each state 
and one column for each character x £ S. With our representation, 2-dimensional arrays are flattened 
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SBase = {1,2, 
4, 3, 
1, 3, 

3, 4, 

4, 4 }; 



(a) (b) (c) 

Sir=bababbababbababababa 
IBase={l, 0,1, 0,1, 1,0, 1,0, 1,1, 0,1, 0,1, 0,1, 0,1,0}; 

(d) 

Fig. 6 Example DFA (a), Grail+ format (b), SBase 1-dimensional transition table representation (c), and representation 
of the DFA input (d) 



into consecutive, 1-dimensional arrays. This representation allows to store multiple DFAs ol different 
alphabet sizes, and it facilitates application of AVX2 gather operations (i.e., gather operations allow 
1-dimensional indexed reads only). Figure EK a) shows our running example DFA from Figure [5] and 
the DFA's Grail+ format (Figure EJb)). Our transition table representation is given in C-like pseudo 
code in Figure [61(c) . Grail+ encodes DFA states as integers. Lines in Grail+ format represent triples 
(source-state, transition- label, target-state), with the start and accepting states indicated on separate 
lines. Our DFA representation encodes states as row-indexes into the DFA transition table. Note that 
State 4 represents the error-state q e . Row-indexes are calculated relative to the base address SBase of 
the array. In case of a second DFA stored after the running example, the second DFA's row indexes 
will also be stored relative to SBase. For the input string, we introduce a 1-dimensional array IBase of 
integers. For example, in FigureJSJd), character a is mapped to the value 0, and character b is mapped 
to 1. Multiple DFA input strings may be concatenated in array IBase. Generation of this DFA and 
input string representation can be trivially implemented while parsing the Grail+ DFA input data. 
Our representation allows to run a single DFA simultaneously on multiple input strings, or to match 
multiple DFAs on one or more input strings. 

Listing 1 Baseline matching routine in C for a possible initial state of a chunk 

// Get address of first and last character of chunk: 
INPUT_T* curPtr =&IBase [StartPos] ; 
INPUT_T* endPtr =&IBase [EndPos] ; 

// Get starting state and perform matching: 
STATE_T CurrentState=InitialState*NrSymbols ; 
for ( ; curPtr ! =endPtr ; curPtr++) { 

CurrentState=SBase [CurrentState + *curPtr]; 

} 



Listing [T] shows how a chunk is matched for one possible initial state on multicore architectures. 
It should be noted that by encoding the transition table's DFA states as offsets relative to the SBase 
base address, 2-dimensional table lookups of conventional DFA representations are simplified to a 1- 
dimensional lookup that avoids the rows-times-column multiplication of 2-dimensional arrays — with our 
representation, we only add the current state's offset to the current input symbol (line 8 of Listing [l}. 
We employ pointers to access the input and to detect loop termination, thereby avoiding the need for 
maintaining a separate loop counter variable. When compiled to x86-64, this matching loop consists of 
only two add operations, one comparison, one indexed load and one conditional jump, which compares 
favorable to Grail+'s matching loop implemented in CH — h, which requires more than an order of 
magnitude more instructions for the same purpose. We used a variant of Listing [T] for sequential DFA 
matching, as an efficient yardstick for our comparisons to the parallelized matching algorithms. 




(FINAL) 
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intcr-node 
communication: 



intra-node 
communication: 




Fig. 7 Hierarchical merging of £-vectors to reduce message delay and variability on EC2. The number of available 
processing cores is denoted by \P\, and the number of cores allocated per node is denoted by \C\. One core per node is 
left unallocated, to avoid performance degradation with hypervised EC2 nodes. 

5.1 Vectorized DFA matching using AVX2 instruction set extensions 

Listing 2 Vectorized DFA matching of chunks using AVX2 intrinsics 
int i ; 

__m256i InpSyms , Ones = _mm256_set l_epi32 (1) ; 

// Load initial indices into SBase and IBase arrays: 

__m256i States = _mm256_ 1 o ad_ s i 256 ( ( __m256 i const *) CStatesInit ) ; 
__m256i Inpldx = _mm256_load_sl256 ( ( __m256i const *) CInputlnit); 

for (i = ChunkLength ; i>0; i--) { 

// Load input characters from IBase, indexed by Inpldx: 

InpSyms = _mm256_i32gather_epi32 (IBase, Inpldx, 4); 

// Calculate indices of next states: 

States = _mm256_add_epi32 (States, InpSyms); 

// Load next state values from SBase, indexed by States: 

States = _mm256_i32gather_epi32 (SBase , States , 4) ; 

// increase input indices by one : 

Inpldx = _mm256_add_epi32 ( Inpldx , Ones); 

} 

Listing [2] shows our core matching loop with 8- fold vectorization employing AVX2 vector instruc- 
tion intrinsics |20j . Data type __m256i represents an 8-way vector containing 8 32-bit int variables. 
Variables States and Inpldx contain the indices into the state transition table SBase and the input 
array IBase. They are initialized to precomputed starting-positions of chunks in lines 5 and 6. We 
use the _mm256_i32gatrier_epi32 intrinsic to perform vectorized, indexed loads from the SBase and 
IBase arrays. For example, in line 8, 8 input characters are loaded from IBase. Note that the offsets in 
vector Inpldx are scaled by a factor of 4 (the third argument of the intrinsic), to account for the 32-bit 
size of type int. For further details on the used intrinsics, we refer to [20] , The reason to count the 
loop index variable down instead of up is because the decrement instruction will already set the x86 
CPU's sign flag when we cross zero. This way we save a cmp instruction which yields additional 12% of 
performance improvement. Neither GCC nor Intel's ICC managed to generate optimal assembly code 
from Listing [21 which required us to use inline assembly instead. Auto- vectorization of sequential DFA 
matching is out of reach for compilers, because of the dependencies between current and next DFA 
state. 

5.2 DFA Matching on Cloud Computing Architectures 

With our implementation for the EC2 cloud computing environment, we employed MPI-based message- 
passing communication to communicate between cores mainly for merging ^-vectors. The chosen MPI 
implementation was MPI-CH2 pQ. As mentioned briefly in previous parts, parallel reduction based on 
binary trees did not achieve satisfactory performance. We found the message transfer times |26] of 
messages between EC2 nodes too high to make binary reduction profitable. E.g., the average inter- 
node transfer time for a single £-vector was 362 microseconds, with a standard-deviation of 3.6%. 
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In comparison, the same intra-node message would take on average only 2.68 microseconds, with a 
standard-deviation of 0.14%. This observation is in line with a recent study that reports large delay 
variations and unstable network throughput for the EC2 cloud [46 . 

To account for the message delay and variations on EC2, we employed a variant of parallel reduction 
that is hierarchical wrt. intra-node and inter-node communication. This 2-tier merging approach is 
based on the observation that intra-node messages showed substantially lower message transfer times 
and variations than inter-node communication. Our reduction proceeds in two steps, as depicted in 
Figure [7J In the first step, £-vectors are merged locally by a designated node leader. In the second 
step, node leaders send their /^-vectors to the master process which combines them to compute the 
overall matching result. Without loss of generality, this 2-step merging scheme requires that on each 
EC2 node, DFA-matching worker processes are allocated to adjacent chunks. Our worker-to-node 
allocation scheme is parameterized by the number of cores to utilize per node, denoted by \C\. For 
reasons explained below, we leave one core unallocated per EC2 node. Figure [7] depicts computation of 
£-vectors by workers (for one chunk), node leaders (the combined map over all chunks matched on a 
node) and the master (the overall map from the first to the last chunk) . Unallocated cores are denoted 
by symbol "o" ) . 

Our two-tier merging scheme outperformed parallel binary reduction and sequential merging for 
even the largest EC2 clusters (i.e., up to 20 nodes, which is the maximum possible EC2 cluster size). We 
found MPI messages among processes on the same node to show both low latency and low variability. 
We conjecture that MPI-CH2 applies shared-memory message passing optimizations similar to [24] for 
node-local communication. Moreover, node-local communication is free from delay variations induced 
by the network that connects nodes. Therefore, with our merging scheme the only communication step 
subjected to EC2's message variability is the merging step conducted by the master. This compares 
favorably to any parallel reduction scheme with more than one reduction step involving inter-node com- 
munication, because each such reduction step may suffer from message delays caused by the underlying 
network. 

As mentioned above, we deliberately left one core per EC2 node unallocated. We observed that 
without sacrificing one core per EC2 node, there was a high probability that one of the workers 
on each node would experience a matching performance on the order of one magnitude lower than 
the workers on the remaining cores. This performance degradation did not affect the offline profiling 
step, for which we took the median of a series of partial matching runs. However, this performance 
degradation randomly showed with DFA matching. Because we could not reproduce this problem 
on a local cluster of Linux computers, we attribute this performance degradation to EC2 hypervisor 
activities that occasionally preempted the execution of one arbitrary worker thread per node. Leaving 
one core unallocated on EC2 eliminated this problem. Given the increasing numbers of cores per CPU, 
leaving one core unallocated can be considered an increasingly small sacrifice (e.g., our experiments 
were conducted with EC2 nodes providing 8 and 16 cores, respectively). 



6 Experimental Results 

We conducted experiments for both our basic and optimized speculative matching algorithms and 
the algorithm presented in [18]. We employed 299 regular expressions from the PCRE library [35] 
and 110 protein patterns from the PROSITE protein database [3B]. Protein patterns were selected as 
an example for the application domain of DNA sequence analysis. We compared our algorithms to 
the baseline sequential DFA matching algorithm from Section [5] and to the currently used matching 
engine that comes with PROSITE. All PCRE regular expressions and PROSITE protein patterns were 
converted to unique minimum DFAs using Grail+ [38,13 . All experiments except the experiments on 
EC2 were conducted with inputs of one million characters. Because we employed up to 288 cores 
on EC2, the problem sizes of one million characters turned out too small for precise performance 
measurements. Thus we used inputs of 8 million characters on EC2. Note that for increased readability 
we represent speed-downs by negative values instead of fractional values. For example, conventional 
denotation for a 2x speed-down is \ but we use -2. 
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Fig. 8 Speedup of our algorithms on the Intel MTL 



Figure [8] shows the results of our speculative parallel DFA membership test with and without 
applying one symbol reverse lookahead, for the PROSITE and PCRE benchmark suites conducted on 
the Intel MTL. We used GCC 4.5.1 on RedHat RHEL 5.4 (x86_64 kernel version 2.6.18-164.el5). ir-axes 
denote the number of states \Q\, and y-ax.es denote the speedup over sequential matching. We note 
the following three observations: (1) Our algorithms always show better performance than sequential 
matching, despite the overhead due to parallelization. (The red horizontal lines denote the break- 
even point where the speedup over sequential matching is 1.) The fact that there are no speed-downs 
validates the failure-freedom of our speculative parallelization. (2) Speedups are always proportional 
to |P|, as predicted by the complexity analysis in Section 14.41 This proves our basic assumption that 
the number of symbols to be processed per processor decides the overall matching time despite the 
overhead due to parallelization. (3) The larger speedups shown in Figure [8(b)] and Figure [8~( d ) | comp ared 



to Figure 8(a) and Figure 8(c) are due to the performance improvements due to our I max optimization 



that reduces the number of initial states to be matched per chunk. 

This result compares favorable to an approach presented in [13], which has a complexity of 0( "jj^ ) 
and thus achieves speedups only if the number of processors is larger than the number of states. We 
evaluated the approach from [18] for both PCRE and the PROSITE patterns, as depicted in Figure [9] 
In-line with the algorithm's complexity results, the previous approach cannot achieve speedups when 
|P| < \Q\. We observed an almost 390x speed-down for the largest DFA that we tested, which has 
766 states. In contrast, our algorithm achieved a speedup between 1.6x and 38. 2x for PCRE, and 
between 1.2x and 13.9x for PROSITE. 

Another experiment conducted on the MTL is the comparison to ScanProsite 12,40], which is the 
reference implementation from the PROSITE protein database. ScanProsite is used to detect signature 
matches in protein sequences. The tool is implemented in Perl; it can be used to find all substrings that 
match a certain PROSITE pattern. We parameterized ScanProsite to find only one match to compare 
with our optimized DFA matching algorithm which determines whether an input string contains a 
certain pattern or not. For a second comparison, we employed the UNIX grep utility with ScanProsite. 
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Fig. 10 Performance of our approach compared to ScanProsite (a) and the UNIX grep utility (b) 



Grep constructs a DFA and uses the Boyer-Moore algorithm for matching [15] ; it is faster than Perl 
which uses backtracking [TT]. As shown in Figure [TO] our algorithm using one symbol reverse lookahead 
is 410.8 to 7781.3 times faster than ScanProsite, and 45.6 to 6665.8 times faster than the UNIX grep 
utility. 



6.1 Performance of Vectorized DFA Matching Using AVX2 Instruction Set Extensions 



Figure [TT] shows the performance improvements achieved by 8-fold vectorization using AVX2 instruc- 



tions. Figure 11(a) and Figure 11(c) show the results of the experiment from Section [5] but this time 



conducted on the SDE emulator. Likewise, Figures 11(b) and 11(d) show the results of the vectorized 



DFA membership tests on the SDE emulator. For compilation of code with AVX2 intrinsics, we used 
ICC version 12.1.4. Version 4.46.0 of the SDE emulator was used. Because no cycle-accurate informa- 
tion is provided by SDE, we used the instruction counts provided by SDE to determine speedups. Our 
experiments show that 8-fold vectorization using AVX2 instructions achieved a 4.45x speedup over 
scalar code. Furthermore, we observed that (1) an 8-core machine with AVX2 achieved performance 
compareable to a 40-core machine on the MTL. The speedups range from 1.2x to 35. 7x for PCRE 
and 0.7x to 13. 2x for PROSITE. (2) Speedup is again proportional to |P|, showing that vectoriza- 
tion is in-line with our complexity analysis from Section 14.41 (3) we observed a 16.0% speed-down 
on average (maximum 31.5%) with very large DFAs due to the overhead of our parallelization for 
SIMD operations. This speed-down is not innate to the algorithms, but due to our implementation, in 
particular the way chunks are allocated to SIMD vector units. The speed-down can be overcome by 
increasing the problem size (which we refrained from, to keep experiments consistent). 
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Fig. 11 Speedup of AVX2 8-fold vectorization over the optimized matching algorithm 



6.2 DFA Matching Performance on Cloud Computing Architectures 

We conducted experiments on the Amazon EC2 elastic computing cloud to determine the performance 
of our speculative DFA matching algorithms on distributed-memory architectures, employing up to 
20 nodes and 288 cores. We explored the adaptation of our load-balancing approach to EC2 nodes 
of varying processing capacities. For the convenience of operating a cluster of EC2 nodes, we used 
StarCluster [35], an open source cluster-computing toolkit for EC2. 

Experiments were conducted on up to 20 cc2.8xlarge EC2 instances, which provide 16 cores per 
node. We again employed one symbol reverse lookahead with our approach. For reasons discussed in 
Section I5T21 we occupied 15 out of 16 cores, resulting in 300 cores in total. For better presentation, 
Figure IT21 shows our experimental results with cluster sizes that are a multiple of 32 cores. We used the 
MPICH2 1 MPI implementation, which we found to provide higher performance on EC2 than Open- 
MPI [32] • We found the communication costs between nodes an important factor on the EC2 cloud. We 
instrumented our matching framework to determine the communication overhead. Figure 
show speedups without taking communication cost into account, and Figure 



Figure 
Figure 



12(c) 



12(d) 



12(a) 



12(b) 



and 
and 



show speedups including communication cost. Figure [TBI depicts the ratio of time spent 
for communication to overall execution time. Although graphs shown in Figure [T3J are irregular due to 
the instability of the EC2 network, we can observe that the communication costs increase as the num- 
ber of processors grows. The communication cost decreases as \Q\ grows, which follows from the fact 
that the required matching time increases with \Q\, which de-emphasizes communication costs. This 
observation explains why PCRE benchmarks, which show smaller I max constants^ are more impacted 
by communication overhead than PROSITE benchmarks. 

The goal of our load-balancing mechanism is to determine chunk sizes such that all processing cores 
are utilized equally, i.e., take equally long for matching their assigned chunk. Processor capacities are 
incorporated in the form of weights (see Eq.([7])). To evaluate the load-balance achieved with our spec- 
ulative DFA matching computations, we used two different types of Amazon EC2 instances, namely 



I.e., smaller sets of possible initial states. 
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Fig. 12 Speedup of I ma x optimization algorithm on cloud computers (cc2.8xlarge instance type on EC2) 
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Fig. 13 Proportional overhead of MPI communication cost on cloud computers (cc2.8xlarge instance type on EC2) 



cc2.8xlarge (denoted as "Fast" in Tabled and m2.4xlarge (denoted as "Slow" in the second column of 
the table). Although the clock frequencies of these EC2 instance types do not differ much (see Table [3]), 
the difference of the processor capacities is observable. We found the ratio of actual processing capacity 
of cc2.8xlarge compared to m2.4xlarge to be 1.41 on average, meaning that cc2.8xlarge on average com- 
putes 41% faster than m2.4xlarge. For this experiment, we allocated inhomogeneous clusters consisting 
of various numbers of cc2.8xlarge and m2.4xlargc instances. To get an indication for the effectiveness 
of our load-balancing scheme, we determined the standard-deviations of DFA matching times across 
all cores of such inhomogeneous EC2 clusters. A balanced load would then be indicated by standard 
deviations close to zero. E.g., the experiment from row 5 was conducted on a mix of four cc2.8xlarge 
instances and one m2.4xlarge instance. The maximum observed standard deviation of execution times 
was 7.0%, with 0.6% minimum standard deviation and 1.3% average across the PROSITE benchmark 
suite. During experiments, we noticed that capacities of cluster nodes could change slightly across 
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EC2 Instances 


PROSITE 


PCRE 


Fast 


Slow 


Min. 


Avg. 


Max. 


Min. 


Avg. 


Max. 





5 


0.0036 


0.0102 


0.0298 


0.0046 


0.0149 


0.0696 


1 


4 


0.0031 


0.0086 


0.0360 


0.0036 


0.0108 


0.0355 


2 


3 


0.0033 


0.0090 


0.0275 


0.0062 


0.0121 


0.0427 


3 


2 


0.0051 


0.0116 


0.0248 


0.0083 


0.0186 


0.0707 


4 


1 


0.0060 


0.0130 


0.0700 


0.0093 


0.0194 


0.0707 


5 





0.0056 


0.0119 


0.0305 


0.0095 


0.0188 


0.0412 



Table 4 Effectiveness of the load- balancing scheme on six configurations of inhomogeneous clusters consisting of two 
types of Amazon EC2 instances, m2.4xlarge and cc2.8xlarge 



cluster invocations, making the re-estimation of processor capacities necessary at cluster startup time. 
(This is in line with the findings from |42] , on performance unpredictability of cloud computing envi- 
ronments.) Hence the adaptability of our load-balancing scheme wrt. processor capacities is essential 
on cloud computing environments. Our observed proportional standard deviations of execution times 
are very low, around 1% on average, as shown in Table |4] In particular, the presented load-balancing 
scheme adapts well to different configurations of inhomogeneous clusters. 



7 Related Work 

Locating a string in a larger text has applications with text editing, compiler front-ends and web 
browsers, internet search engines, computer security, and DNA sequence analysis. Early string searching 
algorithms such as Aho-Corasick [2], Boyer-Moore [7] and Rabin-Karp [35] efficiently match a finite 
set of input strings against an input text. 

Regular expressions allow the specification of infinite sets of input strings. Converting a regular 
expression to a DFA for DFA membership tests is a standard technique to perform regular expression 
matching. The specification of virus signatures in intrusion prevention systems [8,44,39 and the spec- 
ification of DNA sequences [431IB] constitute recent applications of regular expression matching with 
DFAs. 

Considerable research effort has been spent on parallel algorithms for DFA membership tests. 
Ladner et al. [37] applied the parallel prefix computation for DFA membership tests with Mealy 
machines. Hillis and Steele [T7J applied parallel prefix computations for DFA membership tests on the 
65,536 processor Connection Machine. Ravikumar's survey [37J shows how DFA membership tests can 
be stated as a chained product of matrices. Because of the underlying parallel prefix computation, 
all three approaches perform a DFA membership test on input size n in C(log(n)) steps, requiring 
n processors. Their algorithms handle arbitrary regular expressions, but the underlying assumption 
of a massive number of available processors can hardly be met in most practical settings. Misra 32 
derived another C(log(n)) string matching algorithm. The number of required processors is on the 
order of the product of the two string lengths and hence not practical. 

A straight-forward way to exploit parallelism with DFA membership tests is to run a single DFA 
on multiple input streams in parallel, or to run multiple DFAs in parallel. This approach has been 
taken by Scarpazza et al. [41] with a DFA-based string matching system for network security on 
the IBM Cell BE processor. Similarly, Wang et al. [47] investigated parallel architectures for packet 
inspection based on DFAs. Both approaches assume multiple input streams and a vast number of 
patterns (i.e., virus signatures), which is common with network security applications. However, neither 
approach parallelizes the DFA membership algorithm itself, which is required to improve applications 
with single, long-running membership tests such as DNA sequence analysis. 

Scarpazza et al. utilize the SIMD units of the Cell BE's synergistic processing units to match 
multiple input streams in parallel. However, their vectorized DFA matching algorithm contains several 
SISD instructions and the reported speedup from 16-way vectorization is only a factor of 2.51. In 
contrast, our proposed 8-way vectorized DFA membership test avoids SISD instructions, achieving a 
speedup of 4.45 over the sequential version. 
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Recent research efforts focused on speculative computations to parallelize DFA membership tests. 
Holub and Stekr [18] were the first to split the input string into chunks and distribute chunks among 
available processors. Their speculation introduces a substantial amount of redundant computation, 
which restricts the obtainable speedup for general DFAs to O(j^j), where \P\ is the number of pro- 
cessors, and |Q| is the number of DFA states. Their algorithm degenerates to a speed-down when |Q| 
exceeds the number of processors (see also Section [BJ Figure [S]). To overcome this problem, Holub and 
Stekr specialized their algorithm for fc-local DFAs. A DFA is fc-local if for every word of length k and 
for all states p, q G Q it holds that 6*(p,w) = S*(q,w). Starting the matching operation k symbols 
ahead of a given chunk will synchronize the DFA into the correct initial state by the time matching 
reaches the beginning of the chunk, which eliminates all speculative computation. Holub and Stekr 
achieve a linear speedup of 0(|P|) for fc-local automata. Unlike Holub and Stekr's approach, our DFA 
parallelization avoids speed-downs altogether. We use structural properties of general DFAs to limit 
the amount of speculation. In particular, the restriction to fc-local automata is not required. We have 
vectorized our speculative matching routine, and we have extensively evaluated our approach on a 
40-core shared memory architecture, for AVX2 vector instructions, and on the Amazon EC2 cloud 
infrastructure. 

Jones et al. [23] reported that with the IE 8 and Firefox web browsers 3-40% of the execution- 
time is spent parsing HTML documents. To speed up browsing, Jones et al. employ speculation to 
parallelize token detection (lexing) of HTML language front-ends. Similar to Holub and Stekr's fc-local 
automata, they use the preceding k characters of a chunk to synchronize a DFA to a particular state. 
Unlike fc-locality, which is a static DFA property, Jones et al. speculate the DFA to be in a particular, 
frequently occurring DFA state at the beginning of a chunk. Speculation fails if the DFA turns out 
to be in a different state, in which case the chunk needs to be re-matched. Lexing HTML documents 
results in frequent matches, and the structure of regular expressions is reported to be simpler than, 
e.g., virus signatures [30] . Speculation is facilitated by the fact that the state at the beginning of a 
token is always the same, regardless where lexing started. A prototype implementation is reported to 
scale up to six of the eight synergistic processing units of the Cell BE. 

The speculative parallel pattern matching (SPPM) approach by Luchaup et al. [50"I[2"9"] uses specu- 
lation to match the increasing network line-speeds faced by intrusion prevention systems. SPPM DFAs 
represent virus signatures. Like Jones et al., DFAs are speculated to be in a particular, frequently 
occurring DFA state at the beginning of a chunk. SPPM starts the speculative matching at the begin- 
ning of each chunk. With every input character, a speculative matching process stores the encountered 
DFA state for subsequent reference. Speculation fails if the DFA turns out to be in a different state 
at the beginning of a speculatively matched chunk. In this case re-matching continues until the DFA 
synchronizes with the saved history state (in the worst case, the whole chunk needs to be re- matched). 
A single-threaded SPPM version is proposed to improve performance by issuing multiple independent 
memory accesses in parallel. Such pipelining (or interleaving) of DFA matches is orthogonal to our 
approach, which focuses on latency rather than throughput. 

SPPM assumes all regular expressions to be suffix-closed, which is the common scenario with 
intrusion prevention systems; A regular expression is suffix-closed if matching a given string w implies 
that w followed by any suffix is matched, too. A suffix-closed regular language has the property that 
i£i<^ Vw G S* : xw G L. 

Unlike SPPM and the approach by Jones et al., our speculative DFA matching approach does not 
rely on a heavily biased distribution of DFA state frequencies. Instead, we use static DFA properties 
to minimize speculative matching overhead. Our approach is not restricted to suffix-closed regular 
expressions, and our speculation does not rely on the common case being a match (Jones et al.), or 
the common case being a non-match (SPPM). To the best of our knowledge, we are the first to employ 
SIMD gather-operations to fully vectorized the DFA matching process. Our DFA membership test 
provides a load-balancing mechanism for clusters and cloud computing environments. Unlike previous 
approaches, our speculative matching algorithm cannot result in a speed-down. We conducted an 
extensive experimental evaluation on a 40-core shared memory architecture, on a simulator for AVX2 
vector instructions, and on the Amazon EC2 cloud infrastructure. Our benchmarks consist of 299 
regular expressions from the PCRE library [35], and of 110 patterns from the PROSITE protein 
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pattern database |43j . We analyzed the complexity of our speculative matching algorithm, and we 
provide insight on achievable scalability on shared-memory and cloud-computing environments. This 
paper is the extended, journal version of a workshop presentation [9] and a technical report [5]. 

8 Conclusions 

We have presented a speculative DFA pattern matching method for shared-memory, SIMD and cloud 
computing environments. Our parallel matching algorithm exploits structural DFA properties to mini- 
mize the speculative overhead. To the best of our knowledge, this is the first speculative DFA matching 
approach that is failure-free, i.e., (1) it maintains sequential semantics, and (2) it avoids speed-downs 
altogether. On architectures with a SIMD gather-operation for indexed memory loads, our matching 
operation is fully vectorized. Communication patterns specifically for the characteristics of cloud com- 
puting environments are provided. The proposed load-balancing scheme uses an off-line profiling step 
to determine the matching capacity of each participating processor. Based on matching capacities, 
DFA matches are load-balanced on inhomogeneous parallel architectures. We have shown that our al- 
gorithms have a better time complexity than previous work. We conducted an extensive experimental 
evaluation on the PCRE and PROSITE benchmarks on a 4 CPU (40 cores) shared-memory node of 
the Intel Manycore Testing Lab (Intel MTL), on the Intel AVX2 SDE simulator for 8-way fully vec- 
torized SIMD execution, and on a 20-node (288 cores) cluster on the Amazon EC2 computing cloud. 
Our results predict that speculative parallel DFA matching can produce substantial speedups. Unlike 
previous methods, our technique does not impose any restriction on the matched regular expressions. 
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